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Abstract 

Given a model in algebraic statistics and data, the likelihood func- 
tion is a rational function on a projective variety. Algebraic algorithms 
are presented for computing all critical points of this function, with 
the aim of identifying the local maxima in the probability simplex. 
Applications include models specified by rank conditions on matrices 
and the Jukes-Cantor models of phylogenetics. The maximum likeli- 
hood degree of a generic complete intersection is also determined. 

1 Introduction 

A model in algebraic statistics is specified by a polynomial map from the 
space of model parameters to the space of the joint probability distributions 
of the observed discrete random variables. Maximum likelihood estimation 
is concerned with finding those model parameters that best explain a given 
sequence of observations. This is done by maximizing the likelihood function. 
The likelihood function is usually not convex, it can have many local maxima, 
and the problem of finding and certifying a global maximum is difficult. 

Here we consider the problem of finding all critical points of the likelihood 
function, with the aim of identifying all local maxima. The defining equations 
of the critical points are the likelihood equations. The number of complex 
solutions to the likelihood equations (for generic data) is called the maximum 
likelihood (ML) degree of the model. A geometric study of the ML degree was 
undertaken in our joint work with Fabrizio Catanese jB|. The present paper 
offers algebraic algorithms for deriving and solving the likelihood equations. 

We begin by illustrating the problem and our solution for a simple exam- 
ple. In a certain game of chance, a gambler tosses the same coin four times in 
a row, and the number of times heads come up are recorded. Hence the possi- 
ble outcomes are 0, 1, 2, 3, or 4. We observe 1000 rounds of this game, and we 



1 



record the outcomes in the data vector u = (-Uq, ^i, M2, M3, ^4) G N^, where 
Ui is the number of trials that had i heads. Hence uo+Ui+U2+U3+Ui = 1000. 
Suppose we are led to suspect that the gambler uses two biased coins, one in 
each of his sleeves, and he picks the coin to be used at random (with prob- 
abilities Tc and 1 — tt) prior to each round. We wish to test this hypothesis 
using the data u. 

Our model is the mixture of a pair of four-times repeated Bernoulli trials. 
The mixing parameter tt is the probability that the gambler picks the coin 
in his left sleeve. The bias of the left coin is s, and the bias of the right coin 
is t. Our model stipulates that the probabilities of the five outcomes are 

Po = 7r(l-s)4 + (l-v^)(l-^)^ 

pi = 47rs(l-s)=^ + 4(1 -7r)t(l 

P2 = 67rs2(l-s)2 + 6(1 -7r)t2(l -t)2, 

P3 = A7rs'\l-s) + 4(1 -7r)t3(l -t), 

P4 = VTS^ + (l-7r)t^ 

The polynomial pi represents the probability of seeing i heads in a round. 
The likelihood of observing the data u when 1000 trials are made equals 

pTpTpTpTpT (1) 



Uq\UiIU21U31U41 



Maximum likelihood estimation means maximizing subject to < tt, s, t < 
1. The critical equations for this unconstrained optimization problem have 
infinitely many solutions: there is a curve of critical points in the s = t plane. 

In order to avoid such non-identifiability, we reformulate our maximum 
likelihood computation as the following constrained optimization problem: 

Maximize pI°pTpTpTpT subject to det(P) = and poH VPi = 1, (2) 



where 



12j9o 3pi 2p2 
3pi 2p2 3p3 
2p2 3p3 12p4 



The image of the map (vr, s,t) {po,Pi,P2,P3,P4.) over the complex numbers 
is the hypersurface {det(P) = 0} in projective 4-space. Using Algorithm 
we find that the ML degree of this model is 12, i.e., the solution of problem 
leads to an algebraic equation of degree 12. See Examples lUl and ITUl 
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This paper is organized as follows. In Section 2 we introduce the likelihood 
equations associated with an arbitrary projective variety V. The ML degree 
of V is defined as the number of complex solutions to the likelihood equations. 

Section 3 contains an algebraic geometry result. An explicit formula is 
given for the ML degree of a generic complete intersection. This formula is 
an upper bound for the ML degree of more special complete intersections. 

In Section 4 we present an algorithm whose input is an arbitrary homo- 
geneous ideal in a polynomial ring, representing a projective variety V. The 
algorithm uses linear algebra over the coordinate ring R[V^] to find the likeli- 
hood ideal. This ideal typically has finitely many complex solutions. We also 
discuss our test implementation in Singular [13j. It computes all solutions 
numerically and identifies the local maxima in the probability simplex. 

Section 5 comprises an experimental study of the ML degree and number 
of local maxima for various determinantal models, including the one discussed 
above. It is important to note that, in the context of algebraic statistics, 
every variety V comes with a fixed coordinate system. We demonstrate that 
the ML degree is extremely sensitive to changes of coordinates, even just 
scaling of the coordinates. The good news is that in each case the ML degree 
appears to be smallest for the statistically meaningful coordinate system. 

In Section 6 we apply our results to a class of models widely used in 
computational biology: Jukes-Cantor models for phylogenetic trees [H E] . 

The setup of Sections 2-4 assumes that the defining ideal of the model 
V is known. If this ideal is not known and impossible to compute, then 
we are confined to use the (generally less efficient) parametric version of the 
likelihood equations which are discussed in Section 7. In that section we also 
prove that the parametric ML degree (which is the quantity emphasized in 
[S]) equals the implicit ML degree times the cardinality of a generic fiber. 

2 Likelihood Locus on a Projective Variety 

We consider a statistical model which is a subset of the probability simplex 

A„ = {{po,Pi, ...,Pn)e W'^'^ : po, • • • ,Pn > and Po+PiH \-Pn = l}, 

and we assume that the model is presented as the solution set in A„ of a 
system of homogeneous polynomial equations in the unknowns po,pi, . . . ,pn- 
Such polynomials are known as model invariants in the literature on phy- 
logenetics and algebraic statistics |16j. We write V for the Zariski closure 
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of the model in complex projective space P". Equivalently, V is the set of 
all complex solutions to the given homogeneous polynomial equations. The 
maximum likelihood problem is to find a point p = {po : ■ ■ ■ : pn) in the model 



V- 



>o 
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which "best explains" a given data vector (mo,...,m„) G N"^"^. As in (j2 
above, this means solving the following constrained optimization problem: 



Maximize L 



PO Pi Pn 



{po +Pl + 



Pr. 



\Uo+Ul-\ \~u„ 



subject to p G V">o. (3) 



Our approach is to compute all complex critical points of the likelihood 
function L and to extract the positive real solutions that are local maxima. 
While the optimization problem Q requires the Pi to be real and positive, we 
shall compute all the critical points on the complex projective variety V. Let 
Vsing denote the singular locus of the variety V and set Vreg '■= V\Vsing- Let 
P be the homogeneous ideal in the polynomial ring M.[po, pi, . . . ,pn] generated 
by the defining polynomials of V. All computations in the coordinate ring 



i[V] 



o,Pi, 



,Pn]/P 



will be made using standard techniques of Grobner basis theory [7| I12j. 

Definition 1. Let U be the open subset Keg\V(po ■ ■ -Pn-iJ^Pi)) of C P". 
The likelihood locus Zu is the set of points p & U such that dL = 0. The 
likelihood ideal lu C M[y] is the ideal of the Zariski closure of Zu in V. 

We note that this definition differs from the one given in j2j where we also 
included the critical points in V\U and we counted them with multiplities. 

Let {gi,g2, ■ ■ ■ ,gr} be a set of homogeneous polynomials generating the 
ideal P. We consider the Jacobian matrix augmented by a row of ones: 



J 



1 1 

dgi/dpo dgi/dpi 
dg2/dpo dg2/dpi 



1 

dgi/dpn 
dg2/dpn 



(4) 



dgr I dpo dgr /dpi ■■■ dgr/ dpn 
We multiply J by the diagonal matrix whose entries are the unknowns to get 

J = J ■ diag(po,Pi, ■ ■ ■ ,Pn)- 
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Proposition 2. A point p is in the likelihood locus if and only if the 
data vector {uq, . . . , Un) is in the image of the transpose matrix d{p)- 

Proof. Let Kfr be the affine subvariety of <C^^^ defined by P + ~ !)■ 

The Jacobian of K,fj is the matrix J. The likelihood function L has no poles 
or zeros on W, so the critical points of L are the same as the critical points 
of log(L) = Ui log Pi on Kff. A point p G W is a critical point of log(L) if 
and only if d\og{L){p) = . . . , ^) is in the image of J^{p). As pj 7^ on 

U, this is equivalent to m = (mq, • • • , Un) being in the image of j{p)- □ 

Our algorithm for computing the likelihood ideal /„ will be derived from 
Proposition |21 First, however, let us show that /„ is always artinian for 
generic u. Hence the colength of lu is constant for almost all data u. This 
number is the maximum likelihood (ML) degree of the projective variety V. 

Proposition 3. Let I C W x P" be the incidence variety consisting of pairs 
(p, u) where p E Z^- Then X is the projectivization of a vector bundle overlA 
and dimX = n. In particular, Z^ is either empty or finite for generic u. 

Proof. Let c be the codimension of V . For every p eU the matrix J(p) and 
hence the matrix J(p) with their first rows removed have rank c. Multiplying 
J by the vector of ones yields (^Pi, 0, . . . , 0). In particular, for any p E U, 
the first row is linearly independent of the remaining rows, and J(p) has 
rank c + 1 . Thus the set of all u in the image of (p) is a vector space of 
dimension c + 1, and hence X is the projectivization of a vector bundle of 
rank c + 1 over U. It follows that dimX = dimU + c = n. Projecting onto the 
second factor, the generic fiber must either be empty or of dimension 0. □ 

Example 4. Let n = 2 and P = (Po +Pi +P2 ~ 2poPi — 2poP2 — 2piP2). The 
model ^ is a circle in the triangle A2 which is tangent to the three edges of 
5A2. The critical ideal /„ C MJ\V] contains the cubic polynomial 

M2P0P1 - UipIp2 - M2P0P1 + UiPqpI + Uqp\p2 - UqPipI. (5) 

If uo,ui,U2 are distinct, then this cubic curve intersects the circle in six 
points, but only three of them lie in U, which is the part of the circle in the 
interior of the triangle A2. The ML degree of the circle V is three. Hence 
our problem (jS)) can be solved in terms of radicals: use Cardano 's formula to 
express each of the three points in Zu as a function of the data uq, ui, U2- □ 

In Example IH the incidence variety X is the surface in W x defined by 
which is regarded as a bihomogeneous equation of degree (3, 1) in (p, u). 
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3 Complete Intersections 



Here we consider the case when our model V C P" is a complete intersection. 
This means that the codimension c of V coincides with the number r of 
generators of the ideal P. As before, we write P = {gi,g2, ■ ■ ■ ,gr)- Let di 
be the degree of the homogeneous polynomial Qi. Let D denote the sum of all 
monomials of degree at most n — r in r unknowns evaluated a.t di,d2, ... ,dr: 

D = J2 (6) 

ii+i2-\ \-ir<n—r 

Theorem 5. The ML degree of the model V is bounded above by Ddid2 ■ ■ - dr. 
Equality holds when V is a generic complete intersection, that is, when the 
coefficients of the defining polynomials gi, g2, . . . ,gr are chosen at random. 

To illustrate this formula, let us consider some special cases. First, sup- 
pose that our model is a hypersurface (r = 1) defined by one homogeneous 
polynomial g = gi oi degree d = di. Then the ML degree of V is at most 

(i" - 1 

d-D = ^"TTY- (7) 

In Example |3J we considered the case of a quadric in the plane {d = n = 
2) having ML degree three. The upper bound ((Tj) equals six, and this is 
indeed the ML degree of a general quadric. Two special quadrics of statistical 
interest are the Hardy- Weinberg curve p\ = 4poP2 and its cousin pf = poP2- 
The ML degrees of these two special models are one and two respectively. 

Another noteworthy special case arises when ^ is a linear space of codi- 
mension r in P", i.e., di = ■ ■ ■ = dr = 1. Here the open set U is the 
(complexified) complement of an arrangement of + 1 hyperplanes in M"~^, 
and the ML degree equals the number of bounded regions of the (real) ar- 
rangement [T, §4]. If V is generic then the number of bounded regions equals 

An important statistical application of such linear models is discussed in [T]. 

Proof of Theorem\^ We first consider the case when the gi are generic forms 
and u is generic. By Bertini's Theorem, the generic complete intersection V 
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is smooth. All critical points of the likelihood function L on lie in the dense 
open subset W, and the set of critical points is finite, by Proposition El 
Consider the following (r+2) x (n + l)-matrix with entries in M[po; • • • ;Pn]- 









Ui 


Un 


u 

J 




Po 


Pi 

pf. ■ 


Pn 

P"-dpn 

. . r, ^92 

P'^dpn 






. Pok 


^ife ■ 


P^-dp^ 



Let W denote the determinantal variety in P" given by the vanishing of its 
(r + 2) X (r + 2) minors. The codimension of W is at most n — r, which is a 
general upper bound for ideals of maximal minors, and hence the dimension 
of W is at most r. Our genericity assumptions ensure that the matrix J[p) 
has maximal row rank r + 1 for all p & V. Hence a point p (zV lies in W if 
and only if the vector u is in the row span of J{p)- Proposition |21 implies 

z„ = u f^W = V nw. 

Since Zu is finite and V has dimension n — r, we conclude that W has the 
maximum possible codimension, namely n — r, and that the intersection of V 
with the determinantal variety W is transversal. We note that W is Cohen- 
Macaulay, since W has maximal codimension n — r, and ideals of minors of 
generic matrices are Cohen-Macaulay. Bezout's Theorem [01 §8.4] implies 

ML degree = degree(V") ■ degree(W) = di - ■ -dr ■ degree(iy). 

The degree of the determinantal variety W equals the degree of the determi- 
nantal variety given by generic forms of the same row degrees. A special case 
of the Thom-Porteous-Giambelli formula [HI §14.4] states that this degree is 
the complete homogeneous symmetric function of degree codim(14^) = n — r 
evaluated at the row degrees of the matrix. Here, the row degrees are 
0, 1, di, . . . , dr, and the value of that symmetric function is precisely D. We 
conclude that degree (VT) = D. This completes the proof that the ML degree 
of the generic complete intersection V = V{gi, . . . ,gr) equals D ■ did2 ■ ■ ■ dn- 
Suppose now that the Qi are no longer generic. The ML degree of = 
V((7i, . . . ,gr) remains finite by Proposition El The deformation argument in 
|3i Theorem 22] implies that the ML degree of V is at most D ■ did2 ■ ■ ■ dn- □ 



7 



4 Algorithms and Implementation 



We propose the following algorithm for deriving the likelihood equations. 

Algorithm 6. (Computing the likelihood equations) 

Input: A homogeneous ideal P C ffi[]5o, ■ ■ ■ ,Pn] and a vector u G N""*"^. 

Output: The likelihoood ideal of the model V = V{P) for the data u. 

Step 1: Compute c = codim(V^). Let Q be the ideal of the singular locus of 
V, i.e., Q is generated by the c x c minors of the Jacobian matrix of P. 
Step 2: Compute the kernel M of the matrix J over R[V] = R[po,..,pn]/ P- 
Step 3: Let be the ideal in M.[V] generated by the polynomials ^"^g '^i'^i^ 
where the vectors {(po, . . . , 0„) run over a generating set of the module M. 
Step 4: The ideal equals the saturation (/^ : (Po " " -PnC^Pi) ■ Q)°°)- 

Proof of correctness. By Proposition |21 a point p (zU lies in if and only 
if u ■ (j){p) = for every in the kernel of J{p)- Since J{p) has constant 
rank for allp G U, generators of the vector space kernelc( J(j9)) are gotten by 
specializing generators of the module M = kernel]R[v']( J). This shows that 
the ideal vanishes on Z^- Now, let / be a polynomial in the saturation 
of Step 4, i.e. f ■ {po ■ ■ ■ Pn ■ qY ^ I'u fo'^ some g & Q and /c G N. Since this 
product vanishes on Z^, the polynomial / vanishes on Z^, and hence / G /„. 

Conversely, for any g & Q, the module M has a free basis over the local- 
ization M[l^]g.pQ...p^. Any element / of is a linear combination of the dot 
product of u with these free generators with coefficients in By 
clearing denominators we get a polynomial which is a polynomial linear com- 
bination of the generators of This shows that / is in the saturation. □ 

Remark 7. The ML degree of V is computed by running Algorithm IHl for a 
generic vector u G W^~^^. We simply output the colength of J„ after Step 4. 

A key feature of Algorithm IHl is that Step 1 and Step 2 are independent 
of the data u, so they need to be run only once per model. Moreover, these 
preprocessing steps can be enhanced by doing the saturation of Step 4 already 
once at the level of the module M, i.e., after Step 2 one can replace M by 

M := {M:{po---pn-Qr) = R[V]g.p,...p„ ■ M n R[Vr+\ 

For any particular data vector u G N"'*'^, one can then use either M or M in 
Step 3 to define I'^. The remaining saturation in Step 4 requires some tricks 
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in order to run efficiently. We found that, for many models and most data, it 
suffices to saturate only once with respect to a single polynomial, as follows: 

Step 4': Pick a random (c + 1) x (c + l)-submatrix of J and let h be its 
determinant. With some luck, the likelihood ideal /„ will be equal to (/^ : h). 

Here is one more useful variant. When V is a complete intersection, one 
can jump directly to Step 3 and replace by the determinantal variety W 
in the proof of Theorem El Thus, instead of we simply take the ideal of 

u 



(r + 2) X (r + 2) minors of the matrix 



J 



This variant is usually slower than 



Algorithm ini but it is sometimes faster when the codimension r is small. 

Here is one more comment concerning Step 2. Suppose our computer 
algebra system does not support linear algebra over quotient rings (such as 
]R[l^]). Then we can implement Step 2 over the polynomial ring M[po, • • • ,Pn] 
as follows. Instead of computing the kernel of the (r + 1) x (n + l)-matrix J, 
we compute the kernel of the (r+1) x (n + 1 + r + r^)-matrix [JIG], where 



G 



9i 








9r 

91 






9r 








9i 






9r 



Take the ffist n + 1 coordinates from the generators of the kernel of [ J | G ] . 
These vectors generate the module M, so they can be used in Step 3. 

Recall that our objective is to compute maximum likelihood estimates. 

Algorithm 8. (Computing the local maxima of the likelihood function) 
Input: The likelihood ideal lu for the model V and the data u. 
Output: The list of all local maxima for the optimization problem Q. 

Step 1: If dim(/„) = for the given data u, compute the solution set Zu 
numerically using Grobner bases and eigenvalue methods, as in [71 §2]. 

For each positive solution p* G fl V^>o perform the following steps: 
Step 2: Solve the linear system J^(p*) ■ A = m to get Lagrange multipliers 
A*. The Lagrangian C := log(L(p)) — J2l=i K9ii.'P) is a function of p. 
Step 3: Compute the Hessian H{p) of the Lagrangian C{p). Compute the 
restriction of H{p*) to the tangent space kernel( J(p*)) of V at the point p* . 
Step 4: If the restricted H{p*) in Step 3 is negative definite, then output p* 
with its log-likelihood log(L(j9*)) and the eigenvalues of the restricted H{p*). 
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We implemented Algorithms IHl and |H1 in the computer algebra package 
Singular The input is a homogeneous ideal P in a polynomial ring 
and a data vector u. The output is the ML degree and a list of all positive 
local maxima p* and their certificates, namely the (negative) eigenvalues of 
the Hessian H{p*). Step 4 of Algorithm |H1 uses the well-known second order 
optimality conditions in nonlinear optimization, see for instance 

All computational results to be reported in the following sections were 
obtained using this implementation. We also implemented Algorithm IHl in 
Macaulay 2 ^I]. This independently confirmed the reported ML degrees. 



5 Small Matrix Models 



Determinantal varieties are natural objects both in algebraic geometry and 
in statistics. In this section we discuss likelihood equations, ML degree, and 
local maxima for some models specified by rank conditions on 3 x 3 matrices. 

Example 9. Consider the mixture model for Bernoulli random variables 
discussed in the Introduction. This model is given by the determinant of 



Upo 3pi 2p2 
3pi 2p2 3p3 
2p2 3p3 12p4 



The ML degree of this model is twelve, and all twelve solutions to the critical 
equations can be real. In our experiments we found that at most six of these 
solutions are real and positive, and three of those can be local maxima. A 
data vector for which the function has three positive local maxima is 

u = (mo,mi,M2,M3,M4) = (51,18,73,25,75). 
Example 10. Consider the general 3 x 3-matrix with indeterminate entries 



P := 



Poo POI P02 
PW Pll Pl2 
P20 P2I P22 



The prime ideal of 2 x 2 minors of this matrix represents two independent 
ternary random variables. This model has ML degree one. In other words, 
the critical equations have a unique (positive) solution for a given 3x3 data 
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matrix U. This maximum likelihood estimate is a rational function in U, 
namely, it is the unique matrix of rank one with the same row and column 
sums as U. This example is an instance of a decomposable graphical model 
and it is known that the ML degree of such a model is always one |lUj . 

Continuing with our example, let P be the principal ideal generated by 
the 3x3 determinant of P. This is the mixture model for two pairs of 
independent ternary random variables. The ML degree of this mixture model 
equals 10. For a concrete numerical example consider the following data: 



The likelihood ideal lu has four imaginary zeros and six real zeros, all of which 
lie in the positive orthant. Three of these six matrices are local maxima of 
the likelihood function. We list the three local maxima together with the 
values of the likelihood function. The third matrix is the global maximum: 



.13887222 


.18906469 


.080226355 






.12570444 


.039074119 


.17195613 


, log(L) = 


-207.0295890, 


092566192 


.057575479 


.10496037 






.14622787 


.11633326 


.14560213" 






.19982703 


.046435565 


.090472102 


, log(L) = 


-202.9010713, 


011087957 


.12294546 


.12106862 






.20299213 


.11762942 


.087541717" 






.14331103 


.096617294 


.096806365 


, log(L) = 


-202.6703908. 


010839697 


.071467568 


.17279478 







As was mentioned in the Introduction, the ML degree is very sensitive 
to even slight perturbations to the "natural" coordinates of the model. To 
illustrate this, let us scale the unknowns and consider the new matrix 



where the aij are random real numbers. It turns out that the ML degree of 
the ideal of 2 x 2 minors of P' jumps to six. The ML degree of the ideal 
generated by the determinant of P' jumps from 10 to 39 after this change. □ 



U 



16 17 7 
18 3 12 
1 8 16 



P 



ttooPoo "oiPoi a02P02 

"lOPlO "iiPii ai2Pl2 
"20^20 "21^21 "22^22 
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Example 11. Consider the following symmetric 3 x 3-matrix: 

Poi 2pu pi2 

_ P02 Pl2 2p22_ 

The ideal of 2 x 2 minors of this matrix represents two independent identically 
distributed ternary random variables. This model has ML degree 1. Again, 
the mixture model for two copies of the previous model is specified by the 
determinant of P. The ML degree of this mixture model equals 6. 

Note how these ML degrees change if we replace P by the scaled matrix 



P' 



(with aij random reals). 



«00P00 ttOlPol «02P02 

aoiPoi anPn auPu 

_ao2P02 ai2Pl2 a22P22_ 

The ideal of 2 x 2 minors of P' has ML degree 4, which is the degree of 
the corresponding Veronese surface. The first secant variety of the Veronese 
surface is given by the determinant of P'. That model has ML degree 16. □ 

We close this section with a table of ML degrees for seven determinantal 
varieties. The first and second columns are Examples ITUl and ITT] respectivelv. 
The first row indicates the original ideal in the statistically natural coor- 
dinates. The second row refers to the ("scaled") ideal gotten from the first 
( "unsealed" ) ideal by generically scaling the coordinates. The column "3 x 4" 
refers to the maximal minors of a 3 x 4-matrix, "3x3com" is Example El and 
"3 X Acoin" is a similar problem where the coin is tossed five times in a row 
instead of four, and P is the ideal of the 3x3 minors of the matrix 

lOpo 2pi P2 P3 
2pi P2 P3 2p4 

_ P2 P3 2p4 10p5_ 

Finally, G{m, n) is the Pliicker ideal of the Grassmannian of m-planes in 



Model 


3x3 


3 X 3sym 


3x4 


3 X 3 com 


3 X Acoin 


G(2,4) 


G(2,5) 


unsealed 


10 


6 


26 


12 


39 


4 


22 


scaled 


39 


16 


164 


16 


54 


6 


52 



We close with two open problems, aimed at experts in enumerative geometry. 
Problem 12. Find an explanation for all the ML degrees stated above. 
Problem 13. Characterize all models whose ML degree is one. 
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6 Jukes-Cantor Models in Phylogenetics 



The study of "analytic solutions" for maximum likelihood estimation has a 
long tradition in phylogenetics , where one considers evolution models for 
DNA sequence data, and maximum likelihood is used to find the best phy- 
logenetic tree that explains the evolution of the taxa under consideration. 
Maximum likelihood is also used to estimate the branch lengths of the recon- 
structed trees. Here we examine the widely used Jukes-Cantor models, with 
emphasis on the cases studied by Chor et al. and Sainudiin |15]. 

We use the notation of Sturmfels and SuUivant first for binary data 
and later (in Example ITTj) for DNA data. Let us start out with Example 
3 in We consider any tree T with three leaves and the Jukes-Cantor 

model with unknown root distribution. This is equivalent to considering 
trees with four leaves and uniform root distribution. Each tree topology T 
specifies a model for three binary random variables. The joint probabilities 
are represented by unknowns Pijk, for i,j,k G {0,1}. The data is given as 
a 2 X 2 X 2-table u = {uijk) whose entries record the number of occurrences 
of any particular column pattern among three aligned binary sequences. We 
perform the linear change of coordinates given by the Fourier transform: 

qooo = Pooo + Pool + Poio + Poll + Pioo + Pioi + Pno + Pin, 

Qooi = Pooo - Pool + Poio - Poll + Pioo - Pioi + Pno - Pin, 

Qow = Pooo + Pool -Poio - Poll +P100 +P101 - Pno - Pin, 

qou = Pooo - Pool -Poio + Poll +P100 -Pioi - Pno + Pin, 

qioo = Pooo + Pool + Poio + Poll -Pioo -Pioi - Pno - Pin, 

qwi = Pooo - Pool + Poio - Poll - Pioo +P101 - Pno + Pin, 

quo = Pooo + Pool - Poio - Poll - Pioo -Pioi + Pno + Pin, 

gill = Pooo - Pool - Poio + Poll - Pioo +P101 + Pno -Pin- 

The advantage of this transformation is that the defining ideal P of any 
Jukes-Cantor model becomes a toric ideal in the Fourier coordinates qijk- 

Example 14. Let T = Ki ^ be the claw tree with three edges attached to 
the root. Then our model is a complete intersection of codimension 3 in P^: 

P = (%01?110 ~ Q'oooQ'iii, Q'oioQ'ioi ~ Q'OOO^lll, ^lOOQ'Oll ~ Q'oooQ'iii )• 
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Our problem is to solve the following constrained optimization problem: 

111 

maximize Y\ Y\. 11^^*^'' subject to p G V(P) and "^^Pijk = 1- 

i=0 j=0 k=0 ijk 

Algorithm El easily derives the likelihood equations, and it reports that, for 
random data u, the equations have 92 distinct complex solutions. In short, 
the Jukes-Cantor binary model on the claw tree i^i^s has ML degree 92. □ 

Example 15. Suppose that T is one of the three trivalent trees, for instance, 
the one where the leaves 1 and 2 are split from the leaf 3. This model is a 
complete intersection of codimension two. The ideal of model invariants is 

P = ( Q'ooi'Ziio — 9ooo9iii, Q'oioQ'ioi — Q'lOOQ'Oll )• 

This model has dimension 5 and ML degree 14. We found many instances 
where two of the 14 complex solutions to the likelihood equations are local 
maxima in the probability simplex Ay, thus confirming the results of jH]. 

The authors of 6J studied the three-dimensional submodels gotten by 
assuming the molecular clock hypothesis. There are two combinatorial types: 

Pfork = { QlOO " QlOl, QOU — QlOl, QOIO — QlOl, Q'OOl'ZllO — ^000^111 ) 
Pcomb = {QoIO ^ Q'lOO) QoOl "~ Q'lOO) Qoil ^ QlOly Q'lOOQ'llO ~ Q'OOO^lll )• 

The ideal Pfork has ML degree one, and the ideal Pcomb has ML degree nine. 
It was shown in that the local maximum in A7 is unique for Pcomb- CH 

Each rooted tree with leaves {1, 2, 3} is specified by its split system, which 
is a collection E of splits of the set {0, 1, 2, 3} into two non-empty parts. Here 
represents the root. The number of splits equals the dimension of the model. 
The split systems representing the trees in Example and Example El are 

= {{0,123}, {1,023}, {2, 013}, {3, 012}}, 

= {{03,12}, {0,123},{1,023}, {2,013}, {3,012}}. 

David Bryant [2] proposed to generalize phylogenetic models from trees to 
arbitrary splits graphs. Jukes-Cantor models for splits graphs are likely to 
become important for applications. Here is the simplest non-tree example: 
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Example 16. We add one more split to to get the split system 

^ = {{01, 23}, {03, 12}, {0,123}, {1,023}, {2, 013}, {3, 012}}. 
The resulting Jukes-Cantor model is a hypersurface of degree four in P^: 

P = ( Q'oooQ'oioQ'ioi'Ziii — Q'ooiQ'iioQ'iioQ'oii )• 

If we rewrite this quartic in terms of the probabilities ptjk then we get a 
polynomial with 40 terms. The ML degree of this model equals 326. Note 
that this is still a lot smaller than the upper bound 21, 844 given by (|7j). □ 

All of the phylogenetic models whose likelihood equations have been ana- 
lyzed so far assumed binary characters. For applications in biology, models on 
four character states (A, C, G and T) are more important. We next present 
a detailed analysis of the smallest non-trivial Jukes-Cantor DNA model. 

Example 17. Consider the Jukes-Cantor DNA model on a tree with three 
leaves and uniform root distribution. The number of observable states is 
43 = Q4 it turns out that there are only five distinct probabilities. 

We may assume that the tree is the claw tree -ft'1,3. The model parameters 
7i"2, are the probabilities of changing from any letter (A,C,G or T) to 
any other letter when passing from the root to the leaves 1, 2, 3. We write 
9i = 1 — Siii for the probabihty of not changing the letter. Let P123 be the 
probability of observing the same letter at all three leaves, Pij the probability 
of observing the same letter at all leaves i,j and a different one at the third 
leaf, and pdis the probability of seeing three distinct letters. Then 

P123 = 010203 + 37ri7r27r3, 

Pdis = 66'i7r27r3 + 67ri6'27r3 + 67ri7r26'3 + 6711712113, 

P12 = 3^i027r3 + 37ri7r2^3 + 67ri7r27r3, 

Pi3 = 3^i7r2^3 + 37ri6'27r3 + 67ri7r27r3, 

P23 = 37ri^2^3 + 3^i7r27r3 + 67ri7r27r3. 

Here we can either set 0i = 1 — Sni, or we can also regard (6'j : VTj) as 
homogeneous coordinates for . The above formulas define a map P^ x P^ x 
pi _^ p4^ Q^j^ model V is the image of this map. Its defining ideal equals 

P = ( Q'OOOQ'lll ~ ^Oll^lOl^llO )• 
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Here the qijk are the Fourier coordinates which are specified by 



gill = Pl23 + IPdis - Ip12 - Ip13 - Ip23 = (^1 " 7ri)(6'2 - 7r2)(6'3 - TTs) 

gilO = Pl23 - ^Pdis + Pl2 - \Pl1. - \P2Z = {dl - 7ri)(6'2 - 7r2)(6'3 + STTg) 

giOl = Pl23 - IPdis - Ip12 + Pl3 - 1^23 = (^1 - 7ri)(6'2 + 3712) (613 - TTs) 

gOll = P123 - IPdis - Ip12 - |P13 + P23 = (^1 + 37ri)(6'2 - 7r2)(6'3 - TTa) 

gooo = P123 + Pdis + P12 + Pi3 + P23 = (Oi + 37ri)(6'2 + 37r2)(6'3 + 37r3) 

Algorithm IHl reveals that the ML degree of this model equals 23. Using 
Algorithm |H1 we were able to confirm the global maximum reported in jlSl 
Section 5.2] on DNA sequence data for Chimpanzee, Gorilla, and Orangutan. 
The data used in this example is 

(mi23, Udis, ui2, ui3, U23) = (700, 7, 100, 42, 46), 

where there is a second local maximum present. Out of the 23 solutions to 
the critical equations 17 are real, and 7 are positive. Our experiments show 
that there are data for which as many as four positive local maxima exist. 

The authors of study the two-dimensional submodel gotten by assum- 
ing the molecular clock hypothesis. This is the surface in defined by 

Pdock = {QOU — QlOl 1 Q'OOOQ'lll ~ Q'OllQ'lOlQ'llO )• 

The ML degree of Pdock is 11, confirming the maple computation in |3]. □ 

7 Likelihood Equations from Parametrization 

Consider a statistical model which is given parametrically as the image of a 
polynomial map / : M'^ ^ Each coordinate fi of / is a polynomial in 

model parameters 9 = {9i, . . . ,9d), and we have /o + /i + ■ ■ ■ + /n = 1- This is 
usually the natural presentation coming from statistics, and it is the setting 
of jS]. The parametric version of is the following optimization problem: 

Maximize /o(^)"Vi(^)"^ ■ ■ ■ /n(^)"", (9) 

where u = [uo, . . . , Un) is a vector of positive integers and 9 runs over an 
open subset of W^. The critical equations for this optimization problem are 

= for, = l,...,d. (10) 
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In this section we show how to solve these equations directly. In our 
experience, the algorithms of Section 4 are generally preferable if the ideal P 
of algebraic relations among the fi is known. But sometimes the parametric 
algorithm presented below is quite useful as well. Theorem 1201 says that 
(under reasonable assumptions) both methods produce the same answer. 

We consider the Zariski open set Uf in where none of the fi are zero. 
The critical locus is defined in Uj by the vanishing of the equations (jiup . 
Let Ju be the ideal in R[^] = M[^i, . . . ,6d] whose variety is the Zariski closure 
of Zu in all of C^. We call Ju the parametric likelihood ideal for (jH)). 

Of course we can obtain J„ by computing the ideal of the numerators of 
the equations (fTUI) and then saturating by the product of the fi. This has 
the disadvantages of being quite slow in practice and requiring a separate 
computation for each choice of u. We propose the following method instead. 

Algorithm 18. (Computing the parametric likelihood equations) 

Input: Polynomials /o, •••,/« £ ^[0] with J2i fi = ^ cind a vector u G N""*""^. 

Output: Generators of the parametric likelihood ideal Ju C M.[9]. 

Step 1: Compute generators for the kernel over R[5] of the matrix 



M 



7o 

/i 




Ml . . . ^ 



^ 



dfo 



fn 



dfn 

96*1 



dfn 

ddd 



Step 2: For each generator {ipo, . . . , ipn, ^i, • • ■ , ^d)^ of kernelRjej (M) form the 
polynomial J2^=o'^i'^i- Let be the ideal generated by these polynomials. 
Step 3: The desired ideal is equal to the saturation Ju = {Ju '■ (/o/i " " " fn)°^)- 



The proof of correctness for Algorithm is straightforward using the 
setup of 0. The kernel of the matrix M is the module of logarithmic vector 
fields along the hypersurface in defined by /o/i ■ ■ ■ fn- It was shown in [3 
§7] that J'u = Ju holds under certain geometric hypotheses (namely, the map 
/ factors through a smooth variety on which the fi represent global normal 
crossing divisors). In general, we may still have to saturate by Yli fi: but the 
generators of are much closer to the ideal Ju than the numerators of ()10|) . 

Unlike the implicit setting of Section 4, the ideal Ju need not be artinian 
even if u is generic. There can be positive-dimensional components of critical 
points at the locus in ^-space where the parameterization fails to be smooth. 
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Example 19. Let d = 3,n = A and consider the example in the Introduction: 

f, = n{l-s)' + (l-7r)(l-t)\...,/4 = 7r/ + (1 - n)t\ 

The kernel of the 5 x 8-matrix M in ()11|) is minimally generated by 27 vectors 
in M[s, t,7r]*. We compute the parametric likelihood ideal J„ for generic u 
using Steps 2 and 3 of Algorithm ITSl It turns out that J„ is not artinian and 
it has four associated primes. The first is a one- dimensional component: 

{s-U,t-U), where U= /^i + + 3^3 + 4^4 

This component does not depend on vr at all: This is the unique solution of 
the maximum likelihood problem for the unmixed Bernoulli random variable. 
Next there are two components each of which contributes three critical points: 

( vr — 1, s — U, a^t^ + a2t^ + O-it + ao ) 
and ( TT , t — U, ass^ + ^25^ + ais + ao ), 

where the are certain rational expressions in the Uj. These critical points 
are extraneous. They can be explained by noticing that the parameterization 
is singular when either s = t or the mixing parameter vr equals or 1. 

After saturating out these three extraneous components we are left with 
an ideal which is prime over Q(m) . It is artinian and has 24 complex zeros. 
These critical points come in pairs (vr, s,t) and (1 — 7r,t, s). Removing this 
extra symmetry confirms that the true ML degree of this model is 12. □ 

This example suggests that we add one more step to Algorithm ITHl 
Step 4: Let Q be the ideal generated by the d x d minors of the (n + l) x d 
Jacobian matrix Df = [dfi/dOj). Compute and output the saturation 

Ku := {Ju-.Qn- (12) 

The variety V{Q) is the singular locus of the map /, and the saturation 
p2p removes all components of the ideal Ju that lie in this singular locus. 
We close by relating the ideal Ku to the ideal J„ from Sections 2-4. 

Theorem 20. Let f : ^ M"+^ he a polynomial map whose image is 
defined by a homogeneous ideal P as in Section 2. Suppose that f is gener- 
ically finite of degree 6, and the image of Uf\V{Q) lies in the smooth locus 
of V = V{P). For generic u G N"'^"'^, the variety V{Ku) equals the preimage 
ofV{Iu)- In particular, is artinian and its colength is 5 times the ML 
degree of V . 
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Proof. Let g = {gi, g2, . . . , gr) be the generators of P. Then we have gof = 0. 
The Chain Rule imphes J ■ Df = 0, where J is the Jacobian of V^fj as in (@)). 
The smoothness hypotheses guarantee that the rank of J is n+l—d, while the 
rank of Df is d, for all points p = f{6) where 6 G Uf\V{Q). The dimension 
count shows that the image of equals the kernel of Df^. More precisely, 
a vector u lies in the kernel of Df{6)^ if and only if it lies in the image of 
J{p)^ with p = f{6). In light of Propositions |21 and |21 this implies that, for 
u generic, every point p G V(/„) pulls back to 6 points 9 G V{Ku)- □ 
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